Multi-step Reinforcement Learning: A Unifying Algorithm
نویسندگان
چکیده
Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa, Q-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called Q(σ) which unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter, σ, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation). Q(σ) is generally applicable to both onand off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of σ, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance. 1. The Landscape of TD Algorithms Temporal-difference (TD) methods (Sutton, 1988) are an important concept in reinforcement learning (RL) which combines ideas from Monte Carlo and dynamic programming methods. TD methods allow learning to occur directly from raw experience in the absence of a model of the environment’s dynamics, like with Monte Carlo methods, while also allowing estimates to be updated based on other learned estimates without waiting for a final result, like with dynamic programming. The core concepts of TD methods provide a flexible framework for creating a variety of powerful algorithms that can be used for both prediction and control. There are a number of TD control methods that have been proposed. Q-learning (Watkins, 1989) is arguably the most popular, and is considered an off-policy method because the policy generating the behaviour (the behaviour policy), and the policy that is being learned (the target policy) are different. Sarsa (Rummery & Niranjan, 1994; Sutton, 1996) is the classical on-policy control method, where the behaviour and target policies are the same. However, Sarsa can be extended to learn off-policy with the use of importance sampling (Precup, Sutton, & Singh, 2000). Expected Sarsa is an extension of Sarsa that, instead of using the action-value of the next state to update the value of the current state, uses the expectation of all the subsequent action-values of the current state with respect to the target policy. Expected Sarsa has been studied as a strictly on-policy method (van Seijen, van Hasselt, Whiteson, & Wiering, 2009), but in this paper we present a more general version that can be used for both onand off-policy learning and that also subsumes Q-learning. All of these methods are often described in the simple one-step case, but they can also be extended across multiple time steps. The TD(λ) algorithm unifies one-step TD learning with Monte Carlo methods (Sutton, 1988). Through the use of eligibility traces, and the trace-decay parameter, λ ∈ [0, 1], ar X iv :1 70 3. 01 32 7v 1 [ cs .A I] 3 M ar 2 01 7 Multi-step Reinforcement Learning: A Unifying Algorithm a spectrum of algorithms is created. At one end, λ = 1, exists Monte Carlo methods, and at the other, λ = 0, exists one-step TD learning. In the middle of the spectrum are intermediate methods which can perform better than the methods at either extreme (Sutton & Barto, 1998). The concept of eligibility traces can also be applied to TD control methods such as Sarsa and Q-learning, which can create more efficient learning and produce better performance (Rummery, 1995). Multi-step TD methods are usually thought of in terms of an average of many multi-step returns of differing lengths and are often associated with eligibility traces, as is the case with TD(λ). However, it is also natural to think of them in terms of individual n-step returns with their associated nstep backup (Sutton & Barto, 1998). We refer to each of these individual backups as atomic backups, whereas the combination of several atomic backups of different lengths creates a compound backup. In the existing literature, it is not clear how best to extend one-step Expected Sarsa to a multi-step algorithm. The Tree-backup algorithm was originally presented as a method to perform off-policy evaluation when the behaviour policy is non-Markov, non-stationary or completely unknown (Precup et al., 2000). In this paper, we re-present Tree-backup as a natural multi-step extension of Expected Sarsa. Instead of performing the updates with entirely sampled transitions as with multi-step Sarsa, Treebackup performs the update using the expected values of all the actions at each transition. Q(σ) is an algorithm that was first proposed by Sutton and Barto (2017) which unifies and generalizes the existing multi-step TD control methods. The degree of sampling performed by the algorithm is controlled by the sampling parameter, σ. At one extreme (σ = 1) exists Sarsa (full sampling), and at the other (σ = 0) exists Tree-backup (pure expectation). Intermediate values of σ create algorithms with a mixture of sampling and expectation. In this work we show that an intermediate value of σ can outperform the algorithms that exist at either extreme. In addition, we show that σ can be varied dynamically to produce even greater performance. We limit our discussion of Q(σ) to the atomic multi-step case without eligibility traces, but a natural extension is to make use of compound backups and is an avenue for future research. Furthermore, Q(σ) is generally applicable to both onand off-policy learning, but for our initial empirical study we examined only on-policy prediction and control problems. 2. MDPs and One-step Solution Methods The sequential decision problem encountered in RL is often modeled as a Markov decision process (MDP). Under this framework, an agent and the environment interact over a sequence of discrete time steps t. At every time step, the agent receives information about the environment’s state, St ∈ S , where S is the set of all possible states. The agent uses this information to select an action, At, from the set of all possible actions A. Based on the behavior of the agent and the state of the environment, the agent receives a reward, Rt+1 ∈ R ⊂ R, and moves to another state, St+1 ∈ S, with a state-transition probability p(s′|s, a) = P (St+1 = s |St = s,At = a), for a ∈ A and s, s′ ∈ S. The agent behaves according to a policy π(a|s), which is a probability distribution over the set S × A. Through the process of policy iteration (Sutton & Barto, 1998), the agent learns the optimal policy, π∗, that maximizes the expected discounted return: Gt = Rt+1+γRt+2+γ Rt+3+... = T−1 ∑
منابع مشابه
Double Q($\sigma$) and Q($\sigma, \lambda$): Unifying Reinforcement Learning Control Algorithms
Temporal-difference (TD) learning is an important field in reinforcement learning. Sarsa and Q-Learning are among the most used TD algorithms. The Q(σ) algorithm (Sutton and Barto (2017)) unifies both. This paper extends the Q(σ) algorithm to an online multi-step algorithm Q(σ, λ) using eligibility traces and introduces Double Q(σ) as the extension of Q(σ) to double learning. Experiments sugges...
متن کاملDynamic Obstacle Avoidance by Distributed Algorithm based on Reinforcement Learning (RESEARCH NOTE)
In this paper we focus on the application of reinforcement learning to obstacle avoidance in dynamic Environments in wireless sensor networks. A distributed algorithm based on reinforcement learning is developed for sensor networks to guide mobile robot through the dynamic obstacles. The sensor network models the danger of the area under coverage as obstacles, and has the property of adoption o...
متن کاملDevelopment of Reinforcement Learning Algorithm to Study the Capacity Withholding in Electricity Energy Markets
This paper addresses the possibility of capacity withholding by energy producers, who seek to increase the market price and their own profits. The energy market is simulated as an iterative game, where each state game corresponds to an hourly energy auction with uniform pricing mechanism. The producers are modeled as agents that interact with their environment through reinforcement learning (RL...
متن کاملUnifying Task Specification in Reinforcement Learning
Reinforcement learning tasks are typically specified as Markov decision processes. This formalism has been highly successful, though specifications often couple the dynamics of the environment and the learning objective. This lack of modularity can complicate generalization of the task specification, as well as obfuscate connections between different task settings, such as episodic and continui...
متن کاملLow-Area/Low-Power CMOS Op-Amps Design Based on Total Optimality Index Using Reinforcement Learning Approach
This paper presents the application of reinforcement learning in automatic analog IC design. In this work, the Multi-Objective approach by Learning Automata is evaluated for accommodating required functionalities and performance specifications considering optimal minimizing of MOSFETs area and power consumption for two famous CMOS op-amps. The results show the ability of the proposed method to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1703.01327 شماره
صفحات -
تاریخ انتشار 2017